Morphologically motivated word classes for very large vocabulary speech recognition of Finnish and Estonian
نویسندگان
چکیده
We study class-based n-gram and neural network language models for very large vocabulary speech recognition of two morphologically rich languages: Finnish Estonian. Due to morphological processes such as derivation, inflection compounding, the need be trained with sizes several millions word types. Class-based modelling is in this case a powerful approach alleviate data sparsity reduce computational load. For vocabulary, bigram statistics may not an optimal way derive classes. thus utilizing output analyzer achieve efficient show that classes can learned by refining smaller equivalence using merging, splitting exchange procedures suitable constraints. This type classification improve results, particularly when model training large. also extend previous analyses rescoring hypotheses obtained from recognizer models. despite fixed carefully constructed word-based some cases result lower error rates than subword-based unlimited
منابع مشابه
Large Vocabulary Continuous Speech Recognition for Estonian Using Morpheme Classes
This paper describes development of a large vocabulary continuous speaker independent speech recognition system for Estonian. Estonian is an agglutinative language and the number of different word forms is very large, in addition, the word order is relatively unconstrained. To achieve a good language coverage, we use pseudo-morphemes as basic units in a statistical trigram language model. To im...
متن کاملLarge Vocabulary Continuous Speech Recognition for Estonian Using Morphemes and Classes
This paper describes development of a large vocabulary continuous speaker independent speech recognition system for Estonian. Estonian is an agglutinative language and the number of different word forms is very large, in addition, the word order is relatively unconstrained. To achieve a good language coverage, we use pseudo-morphemes as basic units in a statistical trigram language model. To im...
متن کاملEstonian Large Vocabulary Speech Recognition System for Radiology
This paper describes implementation and evaluation of an Estonian large vocabulary continuous speech recognition system prototype for the radiology domain. We used a 44 million word corpus of radiology reports to build a word trigram language model. We recorded a test set of dictated radiology reports using ten radiologists. Using speaker independent speech recognition, we achieved a 9.8% word ...
متن کاملTowards very large vocabulary word recognition
i In mis paper, preliminary considerations and some experimental results are presented in an effort to design Very Large Vocabulary Recognition (VLVR) systems. We will first consider the applicability of current recognition techniques and argue their inadequacy for VLVR. Possible alternate strategies will be explored and their potential usefulness statistically evaluated. Our results indicate t...
متن کاملVocabulary Decomposition for Estonian Open Vocabulary Speech Recognition
Speech recognition in many morphologically rich languages suffers from a very high out-of-vocabulary (OOV) ratio. Earlier work has shown that vocabulary decomposition methods can practically solve this problem for a subset of these languages. This paper compares various vocabulary decomposition approaches to open vocabulary speech recognition, using Estonian speech recognition as a benchmark. C...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Computer Speech & Language
سال: 2021
ISSN: ['1095-8363', '0885-2308']
DOI: https://doi.org/10.1016/j.csl.2020.101141